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Abstract 

^ ! We have developed a sophisticated statistical model for predicting the hitting per- 

^ I formance of Major League baseball players. The Bayesian paradigm provides a prin- 

CN ■ cipled method for balancing past performance with crucial covariates, such as player 

, age and position. We share information across time and across players by using mix- 

I ture distributions to control shrinkage for improved accuracy. We compare the per- 

' formance of our model to current sabermetric methods on a held-out season (2006), 

^ . and discuss both successes and limitations. 

<: 

a ■ 

1 Introduction and Motivation 

T— I ■ 

. There is substantial public and private interest in the projection of future hitting perfor- 

\0 '. mance in baseball. Major league baseball teams award large monetary contracts to top 

^ I free agent hitters under the assumption that they can reasonably expect that past success 

^ I will continue into the future. Of course, there is an expectation that future performance 

(3 ■ will vary, but for the most part it appears that teams are often quite foolishly seduced 

^ ■ by a fine performance over a single season. There are many questions: How should past 

• • ! consistency be balanced with advancing age when projecting future hitting performance? 

. ^ i In young players, how many seasons of above-average performance need to be observed 

^ I before we consider a player to be a truly exceptional hitter? What is the effect of a single 

■ sub-par year in an otherwise consistent career? We will attempt to answer these questions 
through the use of fully parametric statistical models for hitting performance. 

Modeling and prediction of hitting performance is an area of very acti ve research w ithin 
the baseba ll comrnunit y. Popular current methods include PECOTA ( Silver , 2003h and 



MARCEL (|Tangol. |2004|) . PECOTA is an extremely sophisticated method for prediction of 



future performance that is based on matching a player's past career performance to the 
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careers of a set of comparable major league ballplayers. For each player, their set of com- 
parable players is found by a nearest neighbor analysis of past players (both minor and 
major league) with similar performance at the same age. Once a comparison set is found, 
the future performance prediction for the player is based on the historical performance of 
those past comparable players. Factors such as park effects, league effects and physical 
attributes of the player are also taken into account. MARCEL is a very simple system for 
prediction that uses the most recent three seasons of a players major league data, with the 
most recent seasons weighted more heavily. There is also a component of the MARCEL 
methodology that regresses predictions to the overall population mean. These methods 
serve as effective tools for prediction, but our focus is on a fully model-based approach 
that integrates mu ltiple sources o f information from publicly available data, such as the 
Lahman database ( Lahmaiil, 2006h . One advantage of a model-based approach is the abil- 



ity to move beyond point predictions to the incorporation of variability via predictive 
intervals. 

In Section |2l we present a Bayesian hierarchical model for the evolution of hitting per- 
formance throughout the careers of individual players. Bayesian or Empirical Bayes ap- 
proaches have recently been used to rnqdel in dividual hitting events based on various 
within-gam e covariates ( Ouintana et al , 2008 b and for prediction of within-season per- 



formance (Brownl, 20081) . We are addressing a different question: how can we predict the 



course of a particular hitters career based on the seasons of information we have observed 
thus far? Our model includes several covariates that are crucial for the accurate predic- 
tion of hitting for a particular player in a given year. A player's age and home ballpark 
certainly has an influence on their hitting; we will include this information among the co- 
variates in our model. We will also include player position in our model, since we believe 
that position is an important proxy for hitting performance {e.g., second basemen have 
a generally lower propensity for home runs than first basemen). Finally, our model will 
factor past performance of each player into future predictions. In Section |3l we test our 
predictions against a hold out data set, and compare our performance with several com- 
peting methods. As mentioned above, current external methods focus largely on point 
prediction with little attention given to the variance of these predictions. We address this 
issue by examining our results not only in terms of accuracy of our point predictions, 
but also the quality the prediction intervals produced by our model. We also investigate 
several other interesting aspects of our model in Section |3] and then conclude with a brief 
discussion in Section HI 



2 Model and Implementation 



Our data comes from the publicly- available Lahman Baseball Database (" Lahman , 2006), 



which contains hitting totals for each major league baseball player from 1871 to the present 
day, though we will fit our model using only seasons from 1990 to 2005. In total, we have 
10280 player-years of of information from major league baseball between 1990 and 2005 
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that will be used for model estimation. Within each season j, we will use the following 
data for each player i: 



1. Home Run Total : Yij 

2. At Bat Total : Mij 

3. Age : Aij 

4. Home Ballpark : Bij 

5. Position : Rij 

As an example, Barry Bonds in 2001 had Yij = 73 home-runs out of Mij = 664 at bats. 
We excluded pitchers from our model, leaving us with nine positions: first basemen (IB), 
second basemen (2B), third basemen (3B), shortstop (SS), left fielder (LF), center fielder 
(CF), right fielder (RF), catcher (C), and the designated hitter (DH). There were 46 different 
home ballparks used in major league baseball between 1990 and 2005. Player ages ranged 
between 20 and 49, though the vast majority of player ages were between 23 and 44. 



2.1 Hierarchical Model for Hitting 

Our outcome of interest for a given player i in a given year (season) j is their home-run 
total Yij, which we model as a Binomial variable: 

Yij ~ Binomial(Mjj, ^jj) (1) 

where 9ij is a player- and year-specific home run rate, and Mij are the number of oppor- 
tunities (at bats) for player i in year j. Note that by using at-bats as our number of op- 
portunities, we are excluding outcomes such as walks, sacrifice flies and hit-by-pitches. 
We will assume that the number of opportunities Mij are fixed and known so we focus 
our efforts on modeling each home run rate 9ij. The i.i.d. assumption underlyin g the bi- 



nomi al model has already been justified for hitting totals within a single season (|Brownl, 



20081) , and so seems reasonable for hitting totals across an entire season. 



We next model each unobserved player-year rate 6ij as a function of home ballpark b = 
Bij, position k = Rij and age Aij of player i in year j. 

log (r^^) = + + (2) 

The parameter vector a = (ai, . . . , ag) are the position-specific intercepts for each of the 
nine player positions. The function fkiAij) is a smooth trajectory of Aij, that is different 
for each position k. We allow a flexible model for /fc(-) by using a cubic B-spline (|de Boom, 
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19781) with different spline coefficients 7 estimated for each position. The age trajectory 
component of this model involves the estimation of 36 parameters: four B-spline coeffi- 
cients per position x nine different positions. 



We call the parameter vector ^ the "team effects" since these parameters are shared by 
all players with the same team and home ballpark. However, these coefficients can not 
be interpreted as a true "ballpark effect" since they are confounded with the effect of the 
team playing in that ballpark. If a particular team contains many home run hitters, then 
that can influence the effect of their home ballpark. Separating the effect of team versus 
the effect of ballpark would require examining hitting data at the game level instead of 
the seasonal level we are using for our current model. 

There are two additional aspects of hitting performance that are not captured by the 
model outlined in ©-©• Firstly, conditional on the covariates age, position, and ball- 
park, our model treats the home run rate Oij as independent and identically-distributed 
across players i and years j. However, we suspect that not all hitters are created equal: 
we posit that there exists a sub-group of elite home run hitters within each position that 
share a higher mean home run rate. We can represent this belief by placing a mixture 
model on the intercept term dictated by a latent variable E^j in each player-year. In 
other words, 

^ r Ofeo if = 
\ aui if Ejj = 1 

where we force a^o < ati for each position k. We call the latent variable Ej^ the elite status 
for player i in year j. Players with elite status are modeled as having the same shape to 
their age trajectory, but with an extra additive term (on the log-odds scale) that increases 
their home run rate. However, we have a different elite indicator Ejj for each player- 
year, which means that a particular player i can move in and out of elite status during 
the course of his career. Thus, the elite sub-group is maintained in the player population 
throughout time even though this sub-group will not contain the exact same players from 
year to year. 

The second aspect of hitting performance that needs to be addressed is that the past per- 
formance of a particular player should contain information about his future performance. 
One option would be to use player-specific intercepts in the model to allow each player 
to have a different trajectory. However, this model choice would involve a large number 
of parameters, even if these player-specific intercepts were assumed to share a common 
prior distribution. In addition, many of these intercepts would be subject to over-fitting 
due to small number of observed years of data for many players. We instead favor an 
approach that involves fewer parameters (to prevent over-fitting) while still allowing 
different histories for individual players. We accomplish this goal by building the past 
performance of each player into our model through a hidden Markov model on the elite 
status indicators Ej^ for each player i. Specifically, our probability model of the elite status 
indicator for player i in year j + 1 is allowed to depend on the elite status indicator for 
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Figure 1: Hidden Markov Model for Elite Status 

Non-Elite Elite 




player i in year j: 

p(Eij+i = b\Eij = a, Rij = k) = Vahk a, 6 G {0, 1} (3) 

where Ejj is the elite status indicator and Rij is the position of player i in year j. This rela- 
tionship is also graphically represented in Figure [H The Markovian assumption induces 
a dependence structure on the home run rates Oi j over time for each player i. Players 
that show elite performance up until year j are more likely to be predicted as elite at year 
j + 1. The transition parameters Uk = {i^ook, i^oik, i^iok, i^uk) for each position k = 1, . . . , 9 
are shared across players at their position, but can differ between positions, which allows 
for a different proportion of elite players in each position. We initialize each player's 
Markov chain by setting Ejo = for all i, meaning that each player starts their career in 
non-elite status. This initialization has the desired consequence that young players must 
show consistently elite performance in multiple years in order to have a high probability 
of moving to the elite group. 

In order to take a fully Bayesian approach to this problem, we must specify prior distri- 
butions for all of our unknown parameters. The forty-eight different ballpark coefficients 
(3 in our model all share a common Normal distribution, 

A ~ Normal(0, r^) V / = !,..., 48 (4) 

The spline coefficients 7 needed for the modeling of our age trajectories also share a com- 
mon Normal distribution, 

jki ~ Normal(0,r2) V A; = 1, . . . , 9, / = 1, . . . , L (5) 

where L is the number of spline coefficients needed in the modeling of age trajectories for 
f{Aij, Rij) for each position. In our latent mixture model, we also have two intercept co- 
efficients for each position, ak = {ako, which share a truncated Normal distribution, 

ak ~ MVNormal(0, r^/g) ■ Ind(afco < "fci) V A; = 1, . . . , 9 (6) 

where is the 2x1 vector of zeros and I2 is the 2x2 identity matrix. This bivariate 
distribution is truncated by the indicator function Ind(-) to ensure that a^o < for each 
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position k. We make each of the prior distributions (ID)-© non-informative by setting 
the variance hyperparameter to a very large value (10000 in this study). Finally, for 
the position-specific transition parameters of our elite status i/, we use flat Dirichlet prior 
distributions, 

{mk,mk) ~ Dirichlet (cj,cj) V k = l,...,9 

(i^iofc, '^iifc) ~ Dirichlet (o'jt^) V A; = l,...,9 (7) 

These prior distributions are made non-informative by setting to a small value {to = 
1 in this study). We also examined other values for cu and found that using different 
values had no influence on our posterior inference, which is to be expected considering 
the dominance of the data in equation Combining these prior distributions together 
with equations ((I])-© give us the full posterior distribution of our unknown parameters, 

p{a,P,^,u,E\X) oc Y[piYij\Mij,0ij) ■ p{6ij\Rij,Aij,Bij,Eij,a,/3,^) 

id 

■p{Eij\Eij_i,i/) ■ p{a, 0,^,1/). (8) 
where we use X to denote our entire set of observed data Y and covariates {A, B, M, R). 



2.2 MCMC Implementation 



We e stimate our posterior distribution ® by a Gibbs sampling strategy (|Geman and Geman 



19841) . We iteratively sample from the following conditional distributions of each set of 



parameters given the current values of the other parameters: 

1. pia\P,%iy,E,X)=pia\P,^,E,X) 

2. p{P\a,j,^,E,X)=piP\a,'y,E,X) 

3. p{^\P,a,iy,E,X)=p{^\P,a,E,X) 

4. piu\0,j,a,E,X)=piu\E) 

5. piE\P,^,u,E,X) 

where again X denotes our entire set of observed data Y and covariates {A,B,M,R). 
Combined together, steps 1-3 of the Gibbs sampler represent the usual estimation of re- 
gression coefficients {a,P,j) in a Bayesian logistic regression model. The conditional 
posterior distributions for these coefficients are complicated and we employ the com- 
mon strategy_ofusingt^ Metropolis-Hastings algorithm to sample each coefficient (see. 



e.g. ICelman et all ( 20031) ). The proposal distribution for a particular coefficient is a Nor- 



mal distribution centered at the maximum likelihood estimate of that coefficient. The 
variance of this Normal proposal distribution is a tuning parameter that was adaptively 
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adjusted to provide a reasonable rejection /acceptance ratio ( Gelman et al , [l995| ). Step 



4 of the Gibbs sampler involves standard distributions for our transition parameters 
J^fc = (i^ook, ^mki ^iQki ^iik) for each position k = 1, . . . , 9. The conditional posterior dis- 
tributions for our transition parameters implied by (H) are 

(i^oofc, VQik) \E ~ Dirichlet (Noofc + t^, Noit + uj) 

{viik, viok) \E ~ Dirichlet (Nnfc + Niofc + u) (9) 
where '^abk = YlYl ^{^i,t = '^i,t+i = b) over all players i in position k and where rii 

i t=l 

represents the number of years of observed data for player i's career. Finally, step 5 of 
our Gibbs sampler involves sampling the elite status Eij for each year j of player i, which 
can be done usi ng the "For ward-summing Backward-sampling" algorithm for hidden 
Markov models ( Chib , 19961) . For a particular player i, this algorithm "forward-sums" by 



recursively calculating 

1 

cx p{Xi^t\Eit,&)xJ2pi^it\^i,t-i = e,e)-p{Ei^t-i = e\Xi^t-i,Q) (10) 



e=0 



for all t = 1, . . . , ni where Xi^k represents the observed data for player i up until year k, 
Xj fc represents only the observed data for player i in year k, and G represents all other 
parameters. The algorithm then "backward-samples" by sampling the terminal elite state 
Ei^m from the distribution p(Ej „- , 0) and then sampling Ej (_i|Ej ( for t = rii back to 
t = 1. Repeating this algorithm for each player i gives us a complete sample of our elite 
statuses E. We ran multiple chains from different starting values to evaluate convergence 
of our Gibbs sampler. Our results are based on several chains where the first 1000 itera- 
tions were discarded as bum-in. Our chains were also thinned, taking only every eighth 
iteration, in order to eliminate autocorrelation. 



2.3 Model Extension: Player-Specific Transition Parameters 

In Section 12.11 we introduced a hidden Markov model that allows the past performance 
of each player to influence predictions for future performance. If we infer player i to have 
been elite in year t (Ej ^ = 1), then this inference influences the elite status of that player 
in his next year, Ej t+i through the transition parameters u^- However, one potential limi- 
tation of these transition parameters Uk is that they are shared globally across all players 
at that position: each player at position k has the same probability of transitioning from 
elite to non-elite and vice versa. This model assumption allows us to pool information 
across players for the estimation of our transition parameters in Q, but may lead to loss 
of information if players are truly heterogeneous with respect to the probability of transi- 
tioning between elite and non-elite states. In order to address this possibility, we consider 
extending our model to allow player-specific transition parameters in our hidden Markov 
model. 
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Our proposed extension, which we call the PSHMM, has player-specific transition pa- 
rameters I/* = (z/qq, z/q^, uIq, for each player i, that share a common prior distribution. 



Ko^Ki) ~ Dirichlet (u;oofc , t^oife) 
'Iq) ~ Dirichlet {uuk , cuiofc) 



(11) 



where k is the position of player i. Global parameters Uk = (t^oofc, ^oik, i^iifc, ^wk) are now 
allowed to vary with flat prior distributions. This new hierarchical structure allows for 
transition probabilities to vary between players, but still imposes some shrinkage to- 
wards a common distribution controlled by global parameters iVk that are shared across 
players with position k. Under this model extension, the new conditional posterior dis- 
tribution for each z/* is 



where N 



ab 



t=i 



z/^o, z/*i) |E ~ Dirichlet (N^qq + ujook , N^^ + ujoik) 
E ~ Dirichlet (N*i + cuiifc , 

a,Ei^t+i = b). 



01 

i 
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(12) 



To implement this extended model, we must replace step 4 in our Gibbs sampler with a 
step where we draw i/* from (|T2)) for each player i. We must also insert a new step in our 
Gibbs sampler where we sample the global parameters ojk given our sampled values of 
all the I/* values for players at position k. This added step requires sampling {ujook, ^^oik) 
from the following conditional distribution: 



p{uJoOk,UJoik\l') oc 



rii^oofe + i^oifc) 
.r('^oofc)r('-^oifc) 





" Tift 






X 


n^oo 


X 


n-01 




j=i 




j=i 



(13) 



where each product is only over players i at position k and Uk is the number of players 
at position k. We accomplish this sampling by using a Metropolis-Hastings step with 



prop 



true distribution ((131) and Normal proposal distributions: Uqqi^ 
N(u)oifc, cr^)- The means of these proposal distributions are: 



N(cuoofc,o-^) and u;. 



prop 
01k 



^OOk — ^OQk 



^00k{^ — ^OOk) 



- 1 



with 



'Ofe 



and 



I^OIA: 



'1 — '^OOfc) 



^00k{^ — ^OOk) 



'OA; 



(14) 



^OOk 



/ Uk and sg^ = ^(z/qo - '^oofe)^ / «fc 



i=l 



i=l 



where each sum is over all players i at position k and is the number of players at 
position k. These estimates c^oofc and a)oifc were calculated by equating the sample mean 



z/Qofc and sample variance s^^ with the mean and variance of the Dirichlet distribution (|T3|) . 
Similarly, we sample (c^nfc, cuioa.) with the same procedure but with obvious substitutions. 
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3 Results and Model Comparison 



Our primary interest is the prediction of future hitting events, Y^\i for years t+1, . . . based 
on our model and observed data up to year t. We estimate the full posterior distribution 
((HI) and then use this posterior distribution to predict home run totals l^j*2006 ^or each player 
i in the 2006 season. The 2006 season serves as an external validation of our method, since 
this season is not included in our model fit. We use our predicted home run totals 1^2oo6 
for the 2006 season to compare our performance to several previous methods (Section |3^ 
as well as evaluate several internal model choices (Section |3J|) . In Section l33l we present 
inference for other parameters of interest from our model, such as the position-specific 
age curves. 



3.1 Prediction of 2006 Home Run Totals: Internal Comparisons 

We can use our posterior distribution © based on data from MLB seasons up to 2005 to 
calculate the predictive distribution of the 2006 hitting rate ^i,2006 for each player i. 

P(^j,2006|-^) = J P(^j,2006|-Ri,2006) ^,2006) -Bj,2006) Ej,2006) C>!)^)7) 

■p(E,,2oo6 \Ei, u)p{a, 0, 7, u, \X) da d'y du dE (15) 

where X represents all observed data up to 2005. This integral is estimated using the 
sampled values from our posterior distribution p{a,(3^ 7, u, E.i \X) that were generated via 
our Gibbs sampling strategy. 

We can use the posterior predictive distribution (|15|) of each 2006 home run rate (^i.2006 to 
calculate the distribution of the home run total l^j*2006 for each player in the 2006 season. 

P(^i?2006l^) = y P(^i*2006l^i,2006, ^i,2006) ■ P(6'i,2006|^) C^6'j,2006 (16) 

However, the issue with prediction of home run totals is that we must also consider the 
number of opportunities Mj 2006- Since our overall focus has been on modeling home run 
rates (^t,2006/ we will use the true value of Afj^2oo6 for the 2006 season in equation ((161) . Using 
the true value of each Afj,2oo6 gives a fair comparison of the rate predictions (^j,2006 for each 
model choice, since it is a constant scaling factor. This is not a particularly realistic sce- 
nario in a prediction setting since the actual number of opportunities will not be known 
ahead of time. 

Based on the predictive distribution p(^j*2006l-^)' report either a predictive mean 

E(y-*2oo6 1-^) or a predictive interval such thatp(yj*2006 ^ > 0.80. We can examine 

the accuracy of our model predictions by comparing to the observed HR totals 2006 for 
the 559 players in the 2006 season, which we did not include in our model fit. We use the 
following three comparison metrics: 
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1. RMSE: root mean square error of predictive means: ^Yli.'^(Xt\2006\^) " ^1,2006)^/^^ 

2. Interval Coverage: fraction of 80% predictive intervals C* covering observed ^^ 2006 

3. Interval Width: average width of 80% predictive intervals C* 

In Table [H we evaluate our full model outlined in Section IZT] relative to a couple simpler 
modeling choices. Specifically, we examine a simpler version of our model without po- 
sitional information or the mixture model on the a coefficients. We see from Tabled] that 
our full model gives proper coverage and a substantially lower RMSE than the version of 
our model without positional information or the elite /non-elite mixture model. We also 
examine a truly simplistic strawman, which is to take last years home run totals as the 
prediction for this years home run totals (ie. l^j*2006 = ^,2005)- Since this strawman is only 
a point estimate, that comparison is made based solely on the RMSE. As expected, the rel- 
ative performance of this strawman model is terrible, with a substantially higher RMSE 
compared to our full model. Of course, this simple strawman alternative is rather naive 
and in Section |3^ we compare our performance to more sophisticated external prediction 
approaches. 

Table 1: Internal Comparison of Different Model Choices. Measures are calculated over 559 Players from 
2006 season. 



Model 


Coverage Average 
RMSE of 80% Interval 
Intervals Width 


Full Model 

No Position or Elite Indicators 
Strawman: F^^aooe = ^i,2005 
Player-Specific Transitions 


5.30 0.855 9.81 
6.87 0.644 6.56 
8.24 NA NA 
5.45 0.871 10.36 



We also considered an extended model in Section |23] with player-specific transition pa- 
rameters for the hidden Markov model on elite status, and the validation results from 
this model are also given in Table [H Our motivation for this extension was that allow- 
ing player-specific transition parameters might reduce the interval width for players that 
have displayed consistent past performance. However, we see that the overall predic- 
tion accuracy was not improved with this model extension, suggesting that there is not 
enough additional information in the personal history of most players to noticeably im- 
prove the model predictions. Somewhat surprisingly, we also see that the width of our 
80% predictive intervals are not actually reduced in this extended model. The reason is 
that, even for players with long careers of data, the player-specific transition parameters 
fit by this extended model are not extreme enough to force all sampled elite indicators 
-Si,2oo6 to be either or 1, and so the predictive interval is still wide enough to include both 
possibilities. 
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3.2 Prediction of 2006 Home Run Totals: External Comparisons 



Similarly to Section |3.1[ we use hold-out home run data for the 2006 season to evaluate 
our model p redictions compa red to the p redictions from two external methods, PECOTA 
(|Silvej.l2003 )and MARCEL (Tangd, I2OO4I ). both described in Section [U For a reasonable 



comparison set, we focus our external validation on hitters with an empirical home run 
rate of least 1 HR every 40 AB in at least one season up to 2005 (minimum of 300 AB in that 
season). This restriction reduces our dataset for model fitting down to 118 top home run 
hitters who all have predictions from the competing methods PECOTA and MARCEL. As 
noted above, our predicted home-run totals for 2006 are based on the true number of at 
bats for 2006. In order to have a fair comparison to external methods such as PECOTA or 
MARCEL, we also scale the predictions from these methods by the true number of at bats 
in 2006. 

Our approach has the advantage of producing the full predictive distribution of future 
observations (summarized by our predictive intervals). However, the external method 
MARCEL does not produce comparable intervals, so we only compare to other approaches 
in terms of prediction accuracy. We expand our set of accuracy measures to include not 
only the root mean square error (RMSE), but also the median absolute error (MAE). In 
addition to comparing the predictions from each method using overall error rates, we 
also calculated "% BEST" which is, for each method, the percentage of players for which 
the predicted home run total l^j*2006 the closest to the true home run total among all 
methods. Each of these comparison statistics are given in Table |2l In addition to giving 
these validation measures for all 118 players, we also separate our comparison for young 
players (age < 26 years) versus older players (age > 26 years). 



Table 2: Comparison of our model to two external methods on the 2006 predictions of 118 top home-run 
hitters. We also provide this comparison for only young players (age < 26 years) versus only older players 
(age > 26 years). 



Method 


All Players 

RMSE MAE % BEST 


Young Players 

RMSE MAE % BEST 


Older Players 

RMSE MAE % BEST 


Our Model 

PECOTA 

MARCEL 


7.33 4.40 41 % 
7.11 4.68 28 % 
7.82 4.41 31 % 


2.62 1.93 62% 
4.62 3.44 0% 
4.15 2.17 38% 


7.56 4.48 39% 
7.26 4.79 30% 
8.02 4.57 31% 



We see from Table |2l that our model is extremely competitive with the external methods 
PECOTA and MARCEL. When examining all 118 players, our has the smallest median 
absolute error and the highest "% Best" measure, suggesting that our predictions are 
superior on these absolute scales. Our superior performance is even more striking when 
we examine only the young (age < 26 years) players in our dataset. We have the best 
prediction on 62% of all young players, and for these young players, both the RMSE 
and MAE from our method is substantially lower than either PECOTA or MARCEL. We 
credit this superior performance to our sophisticated hierarchical approach that builds in 
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information via position as well as past performance. 

However, our method is not completely dominant: we have a larger root mean square er- 
ror than PECOTA for older players (and overall), which suggests that our model might be 
making large errors on a small number of players. Further investigation shows that our 
model commits its largest errors for players in the designated hitter (DH) position. This 
is somewhat expected, since our model seems to perform best for young players and DH 
is a position almost always occupied by an older player. Beyond this, the model appears 
to be over-shrinking predictions for players in the DH role, perhaps because this player 
position is rather unique and does not fit our model assumptions as well as the other po- 
sitions. However, the validation results are generally very encouraging for our approach 
compared to previous methods, especially among younger players where a principled 
balance of positional information with past performance is most advantageous. 

We further investigate our model dynamics among young players by examining how 
many years of observed performance are needed to decide that a player is an elite home 
run hitter. This question was posited in Section [T] and we now address the question using 
our elite status indicators Eij. Taking all 559 available players examined in Section |3Jl we 
focus our attention on the subset of players that were determined by our model to be in 
the elite group (P{Eij = 1) > 0.5) for at least two years in their career. For each elite home 
run hitter, we tabulate the number of years of observed data that were needed before they 
were declared elite. The distribution of the number of years needed is given in Figure |2l 
We see that although some players are determined to be elite based on just one year of 
observed data, most players (74%) need more than one year of observed performance to 
determine that they are elite home run hitters. In fact, almost half of players (46%) need 
more than two years of observed performance to determine that they are elite home run 
hitters. 

We also investigated our model dynamics among older players by examining the bal- 
ancing of past consistency with advancing age, which was also posited as a question in 
Section [U Specifically, for the older players (age > 35) in our dataset, we examined the 
differences between the 2006 HR rate predictions 6'i 2oo6 = E(^i,2oo6|-^) from our model 
versus the naive prediction based entirely on the previous year 6'i,2oo6 = ^i, 2005 /M, 2005- 
Is our model contribution for a player (which we define as the difference between our 
model prediction 6'^ 2006 and the naive prediction 6^4 2006) rnore a function of advancing age 
or past consistency of that player? Both age and past consistency (measured as the stan- 
dard deviation of their past home run rates) were found to be equally good predictors of 
our model contribution, which suggests that both sources of information are being evenly 
balanced in the predictions produced by our model. 

3.3 Age Trajectory Curves 

In addition to validating our model in terms of prediction accuracy, we can also examine 
the age trajectory curves that are implied by our estimated posterior distribution ©. We 
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Figure 2: Distribution of number of years of observed data needed to infer elite status (P(£'y = 1) > 0.5) 
among all players determined by our model to be elite during their career. 
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will examine these curves on the scale of the home run rate 9ij which is a function of age 
Aij, ball-park h, and elite status Ejj for player i in year j (with position k): 

^ _ exp [(1 - Yjjj) ■ ako + Ejj ■ aki + Pb + fkjAj)] ^-^y^ 
1 + exp [(1 - Eij) ■ Ofco + ■ Ofci + Pb + fkiAj)] 

The shape of these curves can differ by position k, ballpark b and also can differ between 
elite and non-elite status as a consequence of having a different additive effect ako vs. aki- 
In Figure |3l we compare the age trajectories for two positions, DH and SS, for both elite 
player-years {Eij = 1) vs. non-elite player-years {Eij = 0) for an arbitrary ballpark. Each 
graph contains multiple curves (100 in each graph), each of which is the curve implied 
by the sampled values (01,7) from a single iteration of our converged and thinned Gibbs 
sampling output. Examining the curves from multiple samples gives us an indication of 
the variability in each curve. 

We see a tremendous difference between the two positions DH and SS in terms of the 
magnitude and shape of their age trajectory curves. This is not surprising, since home-run 
hitting ability is known to be quite different between designated hitters and shortstops. 
In fact, DH and SS were chosen specifically to illustrate the variability between position 
with regards to home-run hitting. For the DH position, we also see that elite vs. non- 
elite status show a substantial difference in the magnitude of the home-run rate, though 
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Figure 3: Age Trajectories /fe( ) for two positions and elite vs. non-elite status. X-axis is age and Y-axis is 
Rate = e^. 




the overall shape across age is restricted to be the same by the fact that players of both 



statuses share the same fk{Aij) in equation (|T7|) . There is less difference between elite 
and non-elite status for shortstops, in part due to the lower range of values for shortstops 
overall. Not surprisingly, the variability in the curves grows with the magnitude of the 
home run rate. 

We also perform a comparison across all positions by examining the elite vs. non-elite 
intercepts {ao,ai) that were allowed to vary by position. We present the posterior dis- 
tribution of each elite and non-elite intercept in Figure HI For easier interpretation, the 
values of each a^Q and a^i have been transformed into the implied home run rate 9ij for 
very young (age = 23) players in our dataset. We see in Figure |4] that the variability is 
higher for the elite intercept in each position, and there is even more variability between 
positions. The ordering of the positions is not surprising: the comer outfielders and in- 
fielders have much higher home run rates than the middle tnfielder and centerfielder 
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positions. 



Figure 4: Distribution of the elite vs. non-elite intercepts (ap, cn) for each position. The distributions of each 
(aojCCi) are presented in terms of the home run rate dij for very young (age = 23) players. The posterior 
mean is given as a black dot, and the 95% posterior interval as a black line. 
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For a player at a specific position, such as DH, our predictions of his home run rate for 
a future season is a weighted mixture of elite and non-elite DH curves given in Figure |3l 
The amount of weight given to elite vs. non-elite for a given player will be determined by 
the full posterior distribution © as a function of that player's past performance. We illus- 
trate this characteristic of our model in more detail in Figure |5|by examining six different 
hypothetical scenarios for players at the 2B position. Each plot in Figure |5] gives several 
seasons of past performance for a single player, as well as predictions for an additional 
season (age 30). Predictions are given both in terms of posterior draws of the home run 
rate as well as the posterior mean of the home run rate. The elite and non-elite age trajec- 
tories for the 2B position are also given in each plot. We focus first on the left column of 
plots, which shows hypothetical players with consistently high (top row), average (mid- 
dle row), and poor (bottom row) past home run rates. We see in each of these left-hand 
plots that our posterior draws (gray dots) for the next season are a mixture of posterior 
samples from the elite and non-elite curves, though each case has a different proportion 
of elite vs. non-elite, as indicated by the posterior mean of those draws (black x). 

Now, what would happen if each of these players was not so consistent? In Section |T} 
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we asked about the effect of a single sub-par year on our model predictions. The plots 
in the right column show the same three hypothetical players, but with their most recent 
past season replaced by a season with distinctly different (and relatively poor) HR hitting 
performance. We see from the resulting posterior means in each case that only the aver- 
age player (middle row) has his predictions substantially affected by the one season of 
relatively poor performance. Despite the one year of poor performance, the player in the 
top row of Figure |5] is still considered to be elite in the vast majority of posterior draws. 
Similarly, the player in the bottom row of Figure |5] is going to be considered non-elite re- 
gardless of that one year of extra poor performance. The one season of poor performance 
has the most influence on the player in the middle row, since the model has the most 
uncertainty with regards to the elite vs. non-elite status of this average player. 



4 Discussion 



We have presented a sophisticated Bayesian hierarchical model for home run hitting 
among major league baseball players. Our principled approach builds upon informa- 
tion about past performance, age, position, and home ballpark to estimate the underlying 
home run hitting ability of individual players, while sharing information across players. 
Our primary outcome of interest is the prediction of future home run hitting, which we 
evaluated on a held ou t season of data (2006) . Whe n compared to the previous methods, 
PECOTA (Silveri|200i and MARCEL (lTangd.l20oi) . we perform well in terms of predic- 



tion accuracy, especially our "% BEST" measure which tabulates the percentage of players 
for which our predictions are the closest to the truth. Our approach does especially well 
among young players, where a principled balance of positional information with past 
performance seems most helpful. It is worth noting that our method is fully automated 
without any post-hoc adjustments, which could possibly have been used to improve the 
performance of competing methods. In addition, our method has the advantage of esti- 
mating the full posterior predictive distribution of each player, which provides additional 
information in the form of posterior intervals. Beyond our primary goal of prediction, our 
model-based approach also allows us to answer interesting supplemental questions such 
as the ones posed in Section [H 

We have illustrated our methodology using home runs as the hitting event since they are 
a familiar outcome that most readers can calibrate with their own anecdotal experience. 
However, our approach could easily be adapted to other hitting outcomes of interest, such 
as on-base percentage (rate of hits or walks) which has become a popular tool for evalu- 
ating overall hitting quality. Also, although our procedure is presented in the context of 
predicting a single hitting event, we can also extend our methodology in order to model 
multiple hitting outcomes simultaneously. In this more general case, there are several 
possible outcomes of an at-bat (out, single, double, etc.). Our units of observation for a 
given player i in a given year j is now a vector of outcome totals Yij, which can be mod- 
eled as a multinomial outcome: Yij ~ Multinomial(Mjj,0jj) where M^j are the number 
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of opportunities (at bats) for player i in year j and dij is the vector of player- and year- 
specific rates for each outcome. Our underlying model for the rates 9ij as a function of 
position, ball-park and past performance could be extended to a vector of rates 6ij. Our 
preliminary experience with this type of multinomial model indicate that single-event 
predictions (such as home runs) are not improved by considering multiple outcomes si- 
multaneously, though one could argue that a more honest assessment of the variance in 
each event would result from acknowledging the possibility of multiple events from each 
at-bat. 

An important element of our approach was the use of mixture modeling of the player 
population to further refine our estimated home run rates. Sophisticat ed statistical mod - 
els have been used previously to model the careers of baseball hitters ( Berry et al , 1999|) , 



but these approaches have not employed mixtures for the modeling of the player pop- 
ulation. Our internal model comparisons suggest that this mixture model component is 
crucial for the accuracy of our model, dominating even information about player position. 
Using a mixture of elite and non-elite players limits the shrinkage towards the popula- 
tion mean of consistently elite home run hitters, leading to more accurate predictions. 
Our fully Bayesian approach also allows us to investigate the dynamics of our elite status 
indicators directly, as we do in Section |3^ 

In addition to our primary goal of home run prediction, our model also estimates sev- 
eral secondary parameters of interest. We estimate career trajectories for both elite and 
non-elite players within each position. In addition to evaluating the dramatic differences 
between positions in terms of home run trajectories, our fully Bayesian model also has 
the advantage of estimating the variability in these trajectories, as can be seen in Fig- 
ure |3l It is worth noting that our age trajectories do not really represent the typical major 
league baseball career, especially at the higher values of age. More accurately, our tra- 
jectories represent the typical career conditional on the player staying in baseball, which 
is one reason why we do not see dramatic dropoff in Figure |3l Since our primary goal 
is prediction, the fact that our trajectories are conditional is acceptable, since one would 
presumably only be interested in prediction for baseball players that are still in the major 
leagues. However, if one were more interested in estimating unconditional trajectories, 
then a more sophisticated modeling of the drop-out/ censoring process would be needed. 

Our focus in this paper has been the modeling of home run rates 9ij and so we have made 
an assumption throughout our analysis that the number of plate appearances, or oppor- 
tunities, for each player is a known quantity. This is a reasonable assumption when retro- 
spectively estimating past performance, but when predicting future hitting performance 
the number of future opportunities is not known. In order to maintain a fair comparison 
between our method and previous approaches for prediction of future totals, we have 
used the future number of opportunities, which is not a reasonable strategy for real pre- 
diction. A focus of future research is to adapt our sophisticated hierarchical approach to 
the modeling and prediction of plate appearances Mij in addition to our current modeling 
of hitting rates 9ij. 
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Figure 5: Six different hypothetical scenarios for a player at the 2B position. Black curves indicate the elite 
and non-elite age trajectories for the 2B position. Black points represent several several seasons of past 
performance for a single player. Predictions for an additional season are given as posterior draws (gray 
points) of the home run rate and the posterior mean of the home run rate (black x). Left coluirm of plots 
gives hypothetical players with consistently high (top row), average (middle row), and poor (bottom row) 
past home run rates. Right column of plots show the same hypothetical players, but with their most recent 
past season replaced by a relatively poor HR hitting performance. 
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